When you have the fundamental foundation in place, as
described in the preceding section, you can move on to building a
tailored software-driven high-availability solution. Which HA option(s)
you should be using really depends on your HA requirements. The
following high-availability options are used both individually and, very
often, together to achieve different levels of HA:
All these options are
readily available “out of the box” from Microsoft, from the Windows
Server family of products and from Microsoft SQL Server 2008.
It is important to
understand that some of these options can be used together, but not all
go together. For example, you might use Microsoft Cluster Services
(MSCS) along with Microsoft SQL Server 2008’s SQL Clustering to
implement the SQL clustering database configuration, whereas, you
wouldn’t necessarily need to use MSCS with database mirroring.
Microsoft Cluster Services (MSCS)
MSCS could actually be
considered a part of the basic HA foundation components described
earlier, except that it’s possible to build a high-availability system
without it (for example, a system that uses numerous redundant hardware
components and disk mirroring or RAID for its disk subsystem). Microsoft
has made MSCS the cornerstone of its clustering capabilities, and MSCS
is utilized by applications that are cluster enabled. A prime example of
a cluster-enabled technology is Microsoft SQL Server 2008.
MSCS is the advanced
Windows operating system configuration that defines and manages between 2
and 16 servers as “nodes” in a cluster. These nodes are aware of each
other and can be set up to take over cluster-aware applications from any
node that fails (for example, a failed server). This cluster
configuration also shares and controls one or more disk subsystems as
part of its high-availability capability. Figure 1 illustrates a basic two-node MSCS configuration.
MSCS
is available only with Microsoft Windows Enterprise Edition and Data
Center operating system products. Don’t be alarmed, though. If you are
looking at a high-availability system to begin with, there is a great
probability that your applications are already running with these
enterprise-level OS versions.
MSCS can be set up in an
active/passive or active/active mode. Essentially, in an active/passive
mode, one server sits idle (that is, is passive) while the other is
doing the work (that is, is active). If the active server fails, the
passive one takes over the shared disk and the cluster-aware
applications instantaneously.
SQL Clustering
If you want a SQL Server instance
to be clustered for high availability, you are essentially asking that
this SQL Server instance (and the database) be completely resilient to a
server failure and completely available to the application without the
end user ever even noticing that there was a failure (or at least with
minimal interruption). Microsoft provides this capability through the
SQL Clustering option. SQL Clustering is built on top of MSCS for its
underlying detection of a failed server and for its availability of the
databases on the shared disk (which is controlled by MSCS). SQL Server
is said to be a “cluster-aware/enabled” technology.
A SQL Server instance that is
clustered can be created by actually creating a virtual SQL Server
instance that is known to the application (the constant in the equation)
and then two physical SQL Server instances that share one set of
databases. In an active/passive configuration, only one SQL Server
instance is active at a time and just goes along and does its work. If
that active server fails (and with it, the physical SQL Server
instance), the passive server (and the physical SQL Server instance on
that server) simply takes over instantaneously. This is possible because
MSCS also controls the shared disk where the databases are. The end
user and application never really know which physical SQL Server
instance they are on or whether one failed. Figure 2 illustrates a typical SQL Clustering configuration built on top of MSCS.
Setup and management of this
type of configuration are much easier than you might think. More and
more often, SQL Clustering is the method chosen for most
high-availability solutions.
Extending the clustering
model to include Network Load Balancing (NLB) pushes this particular
solution even further into higher availability—from client traffic high
availability to back-end SQL Server high availability. Figure 3
shows a four-host NLB cluster architecture acting as a virtual server
to handle the network traffic coupled with a two-node SQL cluster on the
back end. This setup is resilient from top to bottom.
The four NLB hosts
work together, distributing the work efficiently. NLB automatically
detects the failure of a server and repartitions client traffic among
the remaining servers.
The following apply to SQL Clustering in SQL Server 2008:
Now, you can extend this fault-tolerant solution to embrace more SQL Server instances and all
of SQL Server’s related services. This is a big deal because things
like Analysis Services previously had to be handled with separate
techniques to achieve near high availability. Not anymore; each SQL
Server service is now cluster aware.
Data Replication
The next technology option
that can be utilized to achieve high availability is data replication.
Originally, data replication was created to offload processing from a
very busy server (such as an OLTP application that must also support a
big reporting workload) or to geographically distribute data for
different, very distinct user bases (such as worldwide product ordering
applications). As data replication (transactional replication) became
more stable and reliable, it started to be used to create “warm” (almost
“hot”) standby SQL Servers that could also be used to fulfill basic
reporting needs. If the primary server ever failed, the reporting users
would still be able to work (hence a higher degree of availability
achieved for them), and the replicated reporting database could be used
as a substitute for the primary server, if needed (hence a warm-standby
SQL Server). When doing transactional replication in the “instantaneous
replication” mode, all data changes were replicated to the replicate
servers extremely quickly. With SQL Server 2000, updating subscribers
allowed for even greater distribution of the workload and, overall,
increased the availability of the primary data and distributed the
update load across the replication topology. There are plenty of issues
and complications involved in using the updating subscribers approach
(for example, conflict handlers, queues).
With SQL Server
2005, Microsoft introduced peer-to-peer replication, which is not a
publisher/subscription model, but a publisher-to-publisher model (hence
peer-to-peer). It is a lot easier to configure and manage than other
replication topologies, but it still has its nuances to deal with. This
peer-to-peer model allows excellent availability for this data and great
distribution of workload along geographic (or other) lines. This may
fit some companies’ availability requirements and also fulfill their
distributed reporting requirements as well.
The top of Figure 4
shows a typical SQL data replication configuration of a central
publisher/subscriber using continuous transactional replication. This
can serve as a basis for high availability and also fulfills a reporting
server requirement at the same time. The bottom of Figure 4 shows a typical peer-to-peer continuous transactional replication model that is also viable.
The downside of peer-to-peer
replication comes into play if ever the subscriber (or the other peer)
needs to become the primary server (that is, take over the work from the
original server). This takes a bit of administration that is not
transparent to the end user. Connection strings have to be changed,
ODBC data sources need to be updated, and so on. But this process may
take minutes as opposed to hours of database recovery time, and it may
well be tolerable to end users. Peer-to-peer configurations handle
recovery a bit better in that much of the workload is already
distributed to either of the nodes. So, at most, only part of the user
base will be affected if one node goes down. Those users can easily be
redirected to the other node (peer), with the same type of connection
changes described earlier.
With
either the publisher/subscriber or peer-to-peer replication approach,
there is a risk of not having all the transactions from the publishing
server. However, often, a company is willing to live with this small
risk in favor of availability. Remember that a replicated database is an
approximate image of the primary database (up to the point of the last
update that was successfully distributed), which makes it very
attractive as a warm standby. For publishing databases that are
primarily read-only, using a warm standby is a great way to distribute
the load and mitigate the risk of any one server failing.
Log Shipping
Another, more direct,
method of creating a completely redundant database image is to utilize
log shipping. Microsoft “certifies” log shipping as a method of creating
an “almost hot” spare. Some folks even use log shipping as an
alternative to data replication (it has been referred to as “the poor
man’s data replication”). There’s just one problem: Microsoft has
formally announced that log shipping (as we know and love it) will be
deprecated in the near future. The reasons are many, but the primary one
is that it is being replaced by database mirroring (referred to as real-time log shipping, when it was first being conceived). If you still want to use log shipping, it is perfectly viable—for now.
Log shipping does three primary things:
- Makes an exact image copy of a database on one server from a database dump
- Creates a copy of that database on one or more other servers from that dump
- Continuously applies transaction log dumps from the original database to the copy
In other words,
log shipping effectively replicates the data of one server to one or
more other servers via transaction log dumps. Figure 5 shows a source/destination SQL Server pair that has been configured for log shipping.
Log shipping is a great
solution when you have to create one or more failover servers. It turns
out that, to some degree, log shipping fits the requirement of creating a
read-only subscriber as well. The following are the gating factors for
using log shipping as a method of creating and maintaining a redundant
database image:
Data latency lag is
the time that exists between the transaction log dumps on the source
database and when these dumps are applied to the destination databases.
Sources and destinations must be the same SQL Server version.
Data
is read-only on the destination SQL Server until the log shipping
pairing is broken (as it should be to guarantee that the transaction
logs can be applied to the destination SQL Server).
The data latency
restriction might quickly disqualify log shipping as an instantaneous
high-availability solution (if you need rapid availability of the
failover server). However, log shipping might be adequate for certain
situations. If a failure ever occurs on the primary SQL Server, a
destination SQL Server that was created and maintained via log shipping
can be swapped into use fairly quickly. The destination SQL Server would
contain exactly what was on the source SQL Server (right down to every
user ID, table, index, and file allocation map, except for any changes
to the source database that occurred
after the last log dump was applied). This directly achieves a level of
high availability. It is still not completely transparent, though,
because the SQL Server instance names are different, and the end user
may be required to log in again to the new server instance.
Database Mirroring
Another failover option with
SQL Server is database mirroring. Database mirroring essentially
extends the old log shipping feature of SQL Server and creates an
automatic failover capability to a “hot” standby server. Database
mirroring is being billed as creating a fault-tolerant database that is
an “instant” standby (ready for use in less than three seconds).
At the heart of
database mirroring is the “copy-on-write” technology. Copy-on-write
means that transactional changes are shipped to another server as the
logs are written. All logged changes to the database instance become
immediately available for copying to another location. As you can see in
Figure 6,
database mirroring utilizes a witness server as well as client
components to insulate the client applications from any knowledge of a
server failure.
Combining Failover with Scale-Out Options
SQL Server 2008 pushes
combinations of options to achieve higher availability levels. A prime
example would be combining data replication with database mirroring to
provide maximum availability of data, scalability to users, and fault
tolerance via failover, potentially at each node in the replication
topology. By starting with the publisher and perhaps the distributor,
you make them both database mirror failover configurations.
Building up a combination of
both options together is essentially the best of both worlds: the
super-low latency of database mirroring for fault tolerance and high
availability (and scalability) of data through replication.